CoFormerNet: A Transformer-Based Fusion Approach for Enhanced Vehicle-Infrastructure Cooperative Perception

Bin Li,Yanan Zhao,Huachun Tan
DOI: https://doi.org/10.3390/s24134101
IF: 3.9
2024-06-24
Sensors
Abstract:Vehicle–infrastructure cooperative perception is becoming increasingly crucial for autonomous driving systems and involves leveraging infrastructure's broader spatial perspective and computational resources. This paper introduces CoFormerNet, which is a novel framework for improving cooperative perception. CoFormerNet employs a consistent structure for both vehicle and infrastructure branches, integrating the temporal aggregation module and spatial-modulated cross-attention to fuse intermediate features at two distinct stages. This design effectively handles communication delays and spatial misalignment. Experimental results using the DAIR-V2X and V2XSet datasets demonstrated that CoFormerNet significantly outperformed the existing methods, achieving state-of-the-art performance in 3D object detection.
engineering, electrical & electronic,instruments & instrumentation,chemistry, analytical
What problem does this paper attempt to address?
The paper aims to address the vehicle-infrastructure cooperative perception problem in autonomous driving systems, particularly in traffic scenarios such as blind spots at intersections and T-junctions, obstacle detection, and long-distance obstacle detection. Specifically, the paper proposes the CoFormerNet framework, a Transformer-based fusion approach designed to enhance vehicle-infrastructure cooperative perception capabilities. This framework effectively handles communication delays and spatial misalignment issues by fusing intermediate features at two different stages and leverages the computational resources and broader perspective of the infrastructure to extend the vehicle's perception range. The design of CoFormerNet includes the following major contributions: 1. **Historical Time Information Fusion**: By integrating historical time information from infrastructure sensors through the Time Aggregation Module (TAM), it fully utilizes the computational power and long-distance global perspective of the infrastructure, thereby extending the vehicle's perception field in both time and space. 2. **Two-Stage Intermediate Feature Fusion**: First, intermediate feature fusion is performed after extracting Bird's Eye View (BEV) features; second, fusion is conducted at the decoding layer using the Spatial Modulation Cross-Attention (SMCA) mechanism to alleviate the vehicle-infrastructure sensor calibration issues caused by hard association strategies. 3. **End-to-End Training**: The design allows for end-to-end training, covering all possible communication delays with a single model, significantly reducing model complexity and better optimizing model performance, thereby improving its application effectiveness in real-world environments. Experimental results show that CoFormerNet significantly outperforms existing methods on the DAIR-V2X and V2XSet datasets, achieving state-of-the-art levels in 3D object detection tasks, demonstrating its significant advantages in enhancing the perception capabilities of autonomous driving systems.