Abstract:Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather conditions as well as five sensor failure cases to exploit modal complementarity and resolve partial outages. To make this possible, we present the arbitrary cross-modal segmentation model CMNeXt. It encompasses a Self-Query Hub (SQ-Hub) designed to extract effective information from any modality for subsequent fusion with the RGB representation and adds only negligible amounts of parameters (~0.01M) per additional modality. On top, to efficiently and flexibly harvest discriminative cues from the auxiliary modalities, we introduce the simple Parallel Pooling Mixer (PPX). With extensive experiments on a total of six benchmarks, our CMNeXt achieves state-of-the-art performance on the DeLiVER, KITTI-360, MFNet, NYU Depth V2, UrbanLF, and MCubeS datasets, allowing to scale from 1 to 81 modalities. On the freshly collected DeLiVER, the quad-modal CMNeXt reaches up to 66.30% in mIoU with a +9.10% gain as compared to the mono-modal baseline. The DeLiVER dataset and our code are at: <a class="link-external link-https" href="https://jamycheung.github.io/DELIVER.html" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the application of multi - modal fusion in semantic segmentation, especially the fusion problem of an arbitrary number of modalities. Specifically, the author points out that most of the current multi - modal fusion methods mainly focus on specific sensor combinations and lack research on the fusion of more modalities. In addition, the existing multi - modal fusion methods perform poorly when dealing with partial sensor failures, which is a common problem in actual robot systems. To solve these problems, the author proposes the following key points: 1. **Create a new benchmark dataset**: The author creates a new multi - modal dataset named DELIVER based on the CARLA simulator. This dataset contains depth, lidar, multi - view, event and RGB images, and provides four severe weather conditions and five sensor failure modes to explore modal complementarity and solve the partial sensor failure problem. 2. **Propose a new multi - modal fusion framework**: The author proposes an arbitrary cross - modal segmentation model named CMNeXt, which adopts a novel Hub2Fuse paradigm with an asymmetric dual - branch structure. One branch is used to process RGB images, and the other branch is used to process multiple auxiliary modalities. Useful features are dynamically selected through the Self - Query Hub and these features are efficiently fused through the Parallel Pooling Mixer. 3. **Verify the effectiveness of the model**: The author conducts extensive experiments on multiple public datasets, including DELIVER, KITTI - 360, MFNet, NYU Depth V2, UrbanLF and MCubeS. The experimental results show that CMNeXt achieves state - of - the - art performance on these datasets, especially in dealing with multi - modal fusion and sensor failures. In conclusion, this paper aims to solve the deficiencies of existing methods in dealing with an arbitrary number of modalities and partial sensor failures by creating a new dataset and proposing a new multi - modal fusion framework, thereby improving the robustness and accuracy of semantic segmentation.

Delivering Arbitrary-Modal Semantic Segmentation

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers

GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

Semantic Guidance Fusion Network for Cross-Modal Semantic Segmentation

CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

Mitigating Modality Discrepancies for RGB-T Semantic Segmentation

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

Joint Semantic Segmentation using representations of LiDAR point clouds and camera images