Abstract:Vehicle–infrastructure cooperative perception is becoming increasingly crucial for autonomous driving systems and involves leveraging infrastructure's broader spatial perspective and computational resources. This paper introduces CoFormerNet, which is a novel framework for improving cooperative perception. CoFormerNet employs a consistent structure for both vehicle and infrastructure branches, integrating the temporal aggregation module and spatial-modulated cross-attention to fuse intermediate features at two distinct stages. This design effectively handles communication delays and spatial misalignment. Experimental results using the DAIR-V2X and V2XSet datasets demonstrated that CoFormerNet significantly outperformed the existing methods, achieving state-of-the-art performance in 3D object detection.

What problem does this paper attempt to address?

The paper aims to address the vehicle-infrastructure cooperative perception problem in autonomous driving systems, particularly in traffic scenarios such as blind spots at intersections and T-junctions, obstacle detection, and long-distance obstacle detection. Specifically, the paper proposes the CoFormerNet framework, a Transformer-based fusion approach designed to enhance vehicle-infrastructure cooperative perception capabilities. This framework effectively handles communication delays and spatial misalignment issues by fusing intermediate features at two different stages and leverages the computational resources and broader perspective of the infrastructure to extend the vehicle's perception range. The design of CoFormerNet includes the following major contributions: 1. **Historical Time Information Fusion**: By integrating historical time information from infrastructure sensors through the Time Aggregation Module (TAM), it fully utilizes the computational power and long-distance global perspective of the infrastructure, thereby extending the vehicle's perception field in both time and space. 2. **Two-Stage Intermediate Feature Fusion**: First, intermediate feature fusion is performed after extracting Bird's Eye View (BEV) features; second, fusion is conducted at the decoding layer using the Spatial Modulation Cross-Attention (SMCA) mechanism to alleviate the vehicle-infrastructure sensor calibration issues caused by hard association strategies. 3. **End-to-End Training**: The design allows for end-to-end training, covering all possible communication delays with a single model, significantly reducing model complexity and better optimizing model performance, thereby improving its application effectiveness in real-world environments. Experimental results show that CoFormerNet significantly outperforms existing methods on the DAIR-V2X and V2XSet datasets, achieving state-of-the-art levels in 3D object detection tasks, demonstrating its significant advantages in enhancing the perception capabilities of autonomous driving systems.

CoFormerNet: A Transformer-Based Fusion Approach for Enhanced Vehicle-Infrastructure Cooperative Perception

Slim-FCP: Lightweight-Feature-Based Cooperative Perception for Connected Automated Vehicles

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

Leveraging Temporal Contexts to Enhance Vehicle-Infrastructure Cooperative Perception

Occlusion-Guided Multi-Modal Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Enhancing 3D object detection through multi-modal fusion for cooperative perception

CenterCoop: Center-Based Feature Aggregation for Communication-Efficient Vehicle-Infrastructure Cooperative 3D Object Detection

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer

VEHICLE-INFRASTRUCTURE COOPERATIVE 3D DETECTION VIA FEATURE FLOW PREDICTION

V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network

R. M. Bucke: a Victorian asylum superintendent.

Sensor Fusion by Spatial Encoding for Autonomous Driving

Cooperative Perception With Learning-Based V2V Communications

SiCP: Simultaneous Individual and Cooperative Perception for 3D Object Detection in Connected and Automated Vehicles

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection