Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

Shaohua Dong,Yunhe Feng,Qing Yang,Yan Huang,Dongfang Liu,Heng Fan

2023-12-04

Abstract:Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates in feature extraction and fusion. To address this issue, we propose a surprisingly simple yet effective dual-prompt learning network (dubbed DPLNet) for training-efficient multimodal (e.g., RGB-D/T) semantic segmentation. The core of DPLNet is to directly adapt a frozen pre-trained RGB model to multimodal semantic segmentation, reducing parameter updates. For this purpose, we present two prompt learning modules, comprising multimodal prompt generator (MPG) and multimodal feature adapter (MFA). MPG works to fuse the features from different modalities in a compact manner and is inserted from shadow to deep stages to generate the multi-level multimodal prompts that are injected into the frozen backbone, while MPG adapts prompted multimodal features in the frozen backbone for better multimodal semantic segmentation. Since both the MPG and MFA are lightweight, only a few trainable parameters (3.88M, 4.4% of the pre-trained backbone parameters) are introduced for multimodal feature fusion and learning. Using a simple decoder (3.27M parameters), DPLNet achieves new state-of-the-art performance or is on a par with other complex approaches on four RGB-D/T semantic segmentation datasets while satisfying parameter efficiency. Moreover, we show that DPLNet is general and applicable to other multimodal tasks such as salient object detection and video semantic segmentation. Without special design, DPLNet outperforms many complicated models. Our code will be available at <a class="link-external link-http" href="http://github.com/ShaohuaDong2021/DPLNet" rel="external noopener nofollow">this http URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of multimodal (e.g., RGB-Depth or RGB-Thermal) semantic segmentation, particularly in complex scenarios such as indoor environments or low-light conditions. Existing methods typically employ a dual-branch encoder-decoder framework and use complex feature fusion strategies to achieve multimodal semantic segmentation. This approach requires a large number of parameter updates during the feature extraction and fusion process, leading to high training costs. To address the above issues, the authors propose a new method called DPLNet (Dual-Prompt Learning Network). The core idea of DPLNet is to directly adapt a pre-trained RGB model to the multimodal semantic segmentation task, reducing training costs by minimizing parameter updates. Specifically, DPLNet includes two key components: 1. **Multimodal Prompt Generator (MPG)**: Used to fuse features from different modalities in a compact manner, generating multimodal prompts that are inserted into the frozen backbone network to guide the semantic segmentation process. 2. **Multimodal Feature Adapter (MFA)**: Used to better adapt multimodal features within the frozen backbone network by introducing a small number of trainable parameters to improve multimodal feature extraction. Since both MPG and MFA are lightweight modules, the entire DPLNet only needs to introduce a small number of trainable parameters (a very small fraction compared to the pre-trained backbone network), significantly reducing the computational resources required during training. Additionally, the paper highlights several advantages of DPLNet over existing methods: - Training Efficiency: Only a few parameters need to be adjusted. - Deployment Friendly: No need to retain dual encoders, reducing the deployment burden in practical applications. - Unified Framework: Suitable for various multimodal semantic segmentation tasks, avoiding the need to design complex models for each task. The experimental section validates the effectiveness of DPLNet on several challenging datasets, including NYUD-v2, SUN-RGBD, MFNet, and PST900, demonstrating that the method can achieve or exceed the performance of other complex methods while maintaining parameter efficiency.

Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation

Towards Semi-supervised Dual-modal Semantic Segmentation

MEDANet: More Efficient Dual Attention Network for Scene Segmentation

Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation

DHFNet: Decoupled Hierarchical Fusion Network for RGB-T dense prediction tasks

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Dual-Path Feature Fusion Network for Semantic Segmentation of Remote Sensing Images

DDFL: Dual-Domain Feature Learning for Nighttime Semantic Segmentation

Mitigating Modality Discrepancies for RGB-T Semantic Segmentation

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

DPANET:Dual Pooling Attention Network for Semantic Segmentation

Prompt-Matched Semantic Segmentation

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Double Similarity Distillation for Semantic Image Segmentation

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Optimizing rgb-d semantic segmentation through multi-modal interaction and pooling attention

Real-time efficient semantic segmentation network based on improved ASPP and parallel fusion module in complex scenes

Efficient Context Integration through Factorized Pyramidal Learning for Ultra-Lightweight Semantic Segmentation

DPL: Decoupled Prompt Learning for Vision-Language Models