V2VFormer++: Multi-Modal Vehicle-to-Vehicle Cooperative Perception Via Global-Local Transformer

Hongbo Yin,Daxin Tian,Chunmian Lin,Xuting Duan,Jianshan Zhou,Dezong Zhao,Dongpu Cao
DOI: https://doi.org/10.1109/tits.2023.3314919
IF: 8.5
2024-01-01
IEEE Transactions on Intelligent Transportation Systems
Abstract:Multi-vehicle cooperative perception has recently emerged for facilitating long-range and large-scale perception ability of connected automated vehicles (CAVs). Nonetheless, enormous efforts formulate collaborative perception as LiDAR-only 3D detection paradigm, neglecting the significance and complementary of dense image. In this work, we construct the first multi-modal vehicle-to-vehicle cooperative perception framework dubbed as V2VFormer $++$ , where individual camera-LiDAR representation is incorporated with dynamic channel fusion (DCF) at bird’s-eye-view (BEV) space and ego-centric BEV maps from adjacent vehicles are aggregated by global-local transformer module. Specifically, channel-token mixer (CTM) with MLP design is developed to capture global response among neighboring CAVs, and position-aware fusion (PAF) further investigate the spatial correlation between each ego-networked map in a local perspective. In this manner, we could strategically determine which CAVs are desirable for collaboration and how to aggregate the foremost information from them. Quantitative and qualitative experiments are conducted on both publicly-available OPV2V and V2X-Sim 2.0 benchmarks, and our proposed V2VFormer $++$ reports the state-of-the-art cooperative perception performance, demonstrating its effectiveness and advancement. Moreover, ablation study and visualization analysis further suggest the strong robustness against diverse disturbances from real-world scenarios.
What problem does this paper attempt to address?