IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Junbo Yin,Jianbing Shen,Runnan Chen,Wei Li,Ruigang Yang,Pascal Frossard,Wenguan Wang
2024-03-22
Abstract:Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of 3D object detection in autonomous driving scenarios. Specifically, existing methods based on Bird's Eye View (BEV) representation face some challenges when dealing with 3D space, such as sparse point cloud data and objects usually being small in BEV representation, which leads to difficulties in reliable 3D perception. To solve these issues, the paper proposes a new multi-modal fusion framework—IS-FUSION. The main innovation of IS-FUSION lies in its focus not only on the fusion at the entire scene level but also in the introduction of instance-level multi-modal information fusion, thereby better supporting instance-centric tasks such as 3D object detection. The framework includes two key modules: the Hierarchical Scene Fusion (HSF) module and the Instance-Guided Fusion (IGF) module. HSF captures multi-granularity multi-modal scene context through point-to-grid and grid-to-region transformers; while IGF explores the relationships of instance candidates and aggregates local multi-modal context information for each instance. These instances are then used as guidance to enhance scene features, generating instance-aware BEV representations. In the nuScenes benchmark, IS-FUSION significantly outperforms all published multi-modal 3D detection methods, achieving the best performance to date.