IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Junbo Yin,Jianbing Shen,Runnan Chen,Wei Li,Ruigang Yang,Pascal Frossard,Wenguan Wang

2024-03-22

Abstract:Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at:

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of 3D object detection in autonomous driving scenarios. Specifically, existing methods based on Bird's Eye View (BEV) representation face some challenges when dealing with 3D space, such as sparse point cloud data and objects usually being small in BEV representation, which leads to difficulties in reliable 3D perception. To solve these issues, the paper proposes a new multi-modal fusion framework—IS-FUSION. The main innovation of IS-FUSION lies in its focus not only on the fusion at the entire scene level but also in the introduction of instance-level multi-modal information fusion, thereby better supporting instance-centric tasks such as 3D object detection. The framework includes two key modules: the Hierarchical Scene Fusion (HSF) module and the Instance-Guided Fusion (IGF) module. HSF captures multi-granularity multi-modal scene context through point-to-grid and grid-to-region transformers; while IGF explores the relationships of instance candidates and aggregates local multi-modal context information for each instance. These instances are then used as guidance to enhance scene features, generating instance-aware BEV representations. In the nuScenes benchmark, IS-FUSION significantly outperforms all published multi-modal 3D detection methods, achieving the best performance to date.

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest

mmFUSION: Multimodal Fusion for 3D Objects Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

Center Feature Fusion: Selective Multi-Sensor Fusion of Center-based Objects

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Fully Sparse Fusion for 3D Object Detection

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

EPAWFusion: multimodal fusion for 3D object detection based on enhanced points and adaptive weights

Deep multi-scale and multi-modal fusion for 3D object detection

Enhancing 3D object detection through multi-modal fusion for cooperative perception

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

Dense projection fusion for 3D object detection

DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices