Delivering Arbitrary-Modal Semantic Segmentation

Jiaming Zhang,Ruiping Liu,Hao Shi,Kailun Yang,Simon Reiß,Kunyu Peng,Haodong Fu,Kaiwei Wang,Rainer Stiefelhagen
2023-03-03
Abstract:Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather conditions as well as five sensor failure cases to exploit modal complementarity and resolve partial outages. To make this possible, we present the arbitrary cross-modal segmentation model CMNeXt. It encompasses a Self-Query Hub (SQ-Hub) designed to extract effective information from any modality for subsequent fusion with the RGB representation and adds only negligible amounts of parameters (~0.01M) per additional modality. On top, to efficiently and flexibly harvest discriminative cues from the auxiliary modalities, we introduce the simple Parallel Pooling Mixer (PPX). With extensive experiments on a total of six benchmarks, our CMNeXt achieves state-of-the-art performance on the DeLiVER, KITTI-360, MFNet, NYU Depth V2, UrbanLF, and MCubeS datasets, allowing to scale from 1 to 81 modalities. On the freshly collected DeLiVER, the quad-modal CMNeXt reaches up to 66.30% in mIoU with a +9.10% gain as compared to the mono-modal baseline. The DeLiVER dataset and our code are at: <a class="link-external link-https" href="https://jamycheung.github.io/DELIVER.html" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the application of multi - modal fusion in semantic segmentation, especially the fusion problem of an arbitrary number of modalities. Specifically, the author points out that most of the current multi - modal fusion methods mainly focus on specific sensor combinations and lack research on the fusion of more modalities. In addition, the existing multi - modal fusion methods perform poorly when dealing with partial sensor failures, which is a common problem in actual robot systems. To solve these problems, the author proposes the following key points: 1. **Create a new benchmark dataset**: The author creates a new multi - modal dataset named DELIVER based on the CARLA simulator. This dataset contains depth, lidar, multi - view, event and RGB images, and provides four severe weather conditions and five sensor failure modes to explore modal complementarity and solve the partial sensor failure problem. 2. **Propose a new multi - modal fusion framework**: The author proposes an arbitrary cross - modal segmentation model named CMNeXt, which adopts a novel Hub2Fuse paradigm with an asymmetric dual - branch structure. One branch is used to process RGB images, and the other branch is used to process multiple auxiliary modalities. Useful features are dynamically selected through the Self - Query Hub and these features are efficiently fused through the Parallel Pooling Mixer. 3. **Verify the effectiveness of the model**: The author conducts extensive experiments on multiple public datasets, including DELIVER, KITTI - 360, MFNet, NYU Depth V2, UrbanLF and MCubeS. The experimental results show that CMNeXt achieves state - of - the - art performance on these datasets, especially in dealing with multi - modal fusion and sensor failures. In conclusion, this paper aims to solve the deficiencies of existing methods in dealing with an arbitrary number of modalities and partial sensor failures by creating a new dataset and proposing a new multi - modal fusion framework, thereby improving the robustness and accuracy of semantic segmentation.